nyc taxi-nongeo ¶
Load NYC Taxi data ¶
(takes a dozen seconds or so...)
import pandas as pd
df = pd.read_csv('data/nyc_taxi.csv',usecols=['trip_distance','fare_amount','tip_amount','passenger_count'])
df.tail()
Define a simple plot ¶
from bokeh.plotting import figure, output_notebook, show
output_notebook()
def base_plot():
p = figure(
x_range=(0, 20),
y_range=(0, 40),
tools='pan,wheel_zoom,box_zoom,reset',
plot_width=800,
plot_height=500,
)
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
p.xaxis.axis_label = "Distance, miles"
p.yaxis.axis_label = "Fare, $"
p.xaxis.axis_label_text_font_size = '12pt'
p.yaxis.axis_label_text_font_size = '12pt'
return p
options = dict(line_color=None, fill_color='blue', size=5)
1000 points reveals the expected linear relationship ¶
samples = df.sample(n=1000)
p = base_plot()
p.circle(x=samples['trip_distance'], y=samples['fare_amount'], **options)
show(p)
10,000 points show more detailed, systematic patterns in fares and times ¶
Perhaps there are different metering options, along with granularity in how times and fares are counted; in any case, the times and fares do not uniformly populate any region of this space:
options = dict(line_color='blue', fill_color='blue', size=1, alpha=0.05)
samples = df.sample(n=10000)
p = base_plot()
p.circle(x=samples['trip_distance'], y=samples['fare_amount'], **options)
show(p)
Datashader reveals additional detail, especially when zooming in ¶
You can now see that there are a lot of points below the linear boundary, representing long trips for very little cost (presumably GPS errors?).
import datashader as ds
from datashader.bokeh_ext import InteractiveImage
p = base_plot()
pipeline = ds.Pipeline(df, ds.Point("trip_distance", "fare_amount"))
InteractiveImage(p, pipeline)
Here we're using the default histogram-equalized color mapping function to reveal density differences across this space. If we used a linear mapping, we can mainly see that there are a lot of values near the origin, but all the rest are colored the same minimum (defaulting to light blue) color:
from datashader import transfer_functions as tf
import functools as ft
color_fn = ft.partial(tf.shade,how='linear')
p = base_plot()
pipeline = ds.Pipeline(df, ds.Point("trip_distance", "fare_amount"), color_fn=color_fn)
InteractiveImage(p, pipeline)
Fares are discretized to the nearest 50 cents, making patterns less visible, but there is both an upward trend in tips as fares increase (as expected), but also a large number of tips higher than the fare itself, which is surprising:
p = base_plot()
p.xaxis.axis_label = "Fare, $"
p.yaxis.axis_label = "Tip, $"
pipeline = ds.Pipeline(df, ds.Point("fare_amount", "tip_amount"))
InteractiveImage(p, pipeline)
Interestingly, tips go down when the number of passengers is greater than 1:
import datashader as ds
from datashader.bokeh_ext import InteractiveImage
from bokeh.models import Range1d
p = base_plot()
p.xaxis.axis_label = "Passengers"
p.yaxis.axis_label = "Tip, $"
p.x_range = Range1d(-0.5, 6.5)
p.y_range = Range1d(0, 60)
pipeline = ds.Pipeline(df, ds.Point("passenger_count", "tip_amount"), width_scale=0.035)
InteractiveImage(p, pipeline)
Here we've reduced the resolution along the x axis so that instead of getting isolated points for this inherently discrete data, you can see more-visible horizontal line segments.
The above plots all use Bokeh directly, but a much wider range of interactive plots can be built easily using the separate HoloViews library, which builds Bokeh and Matplotlib plots from high-level specifications. For instance, Datashader currently only provides 2D aggregates, but you can easily make a zoomable one-dimensional histogram using HoloViews to dynamically collapse across a second dimension:
result=None
try:
import numpy as np
import holoviews as hv
from holoviews.operation.datashader import aggregate
hv.notebook_extension('bokeh')
%opts Curve [width=800]
dataset = hv.Dataset(df, kdims=['fare_amount', 'trip_distance'], vdims=[]).select(fare_amount=(0,60))
agg = aggregate(dataset, aggregator=ds.count(), streams=[hv.streams.RangeX()], x_sampling=0.5, width=500, height=2)
result = agg.map(lambda x: x.reduce(trip_distance=np.sum), hv.Image)
except ImportError: pass
result
Here datashader is aggregating over both fare_amount and trip_distance, but trip_distance was specified to have only a height of 2, because it will be further collapsed to create the histogram being displayed. You can now use the wheel zoom tool when hovering over the x axis, and the plot will zoom in or out, dynamically resampling at the given location to make a new histogram (as long as there is a live Python server running).
In this particular plot, there is a very wide range of fare amounts, with an implausibly high maximum fare of over \$4000, but you can easily zoom in to the bulk of the data to show that nearly all fares are between \$4 and \$20, following something like a gamma distribution, and they are discretized to the nearest $0.50 in this dataset.